This notebook contains the final report for the project of the course A Network Tour Of Data Science.

Project : Course Suggestor

Assistant : Michaël Defferrard

Students : Cherqui Alexandre, Gusset Frédérick, El Hamamsy Laila, Triquet Thomas



In [1]:

    
import pandas as pd
import networkx as nx
import os
import pickle

1. Introduction

Once the master level is reached, students have the opportunity to choose between a vast pool of courses, as well as courses which are not part of their original study plan. However, aside from going through all the coursebooks or asking other people's opinions, there is nothing to help students select courses which may be relevant to their domains of interest. That is why the goal of this project was to work on a course suggester where the student could input the name of a course of interest or a list of courses taken and obtain a list of courses which are related.

To do so a graph where the nodes are the courses and the links are given by features was created by recovering the necessary information from the different EPFL databases. The idea was to make use of the list of courses that students took as well as the course descriptions to weight the edges. From the latter multiple attributes were recovered: the professor, keywords, pre-requisites, teaching methods, assessments, the faculty to which the course is attached and so on. All of this information was public either via is-academia or the public access.

The courses each student took between 2007 and 2016, their affiliation (MT, EL, IN...), the current semester (BA1, BA2,...) as well as an XML with the course descriptions of the courses given in 2016 were provided by Francisco Pinto who already distributed this dataset during the DataJam days and can be found on the dropbox link here. Therefore it was not necessary to do any web-scraping but there was a large portion of data cleaning and formatting before being able to construct the different graphs. It is important to note that the data is anonymous so there is no way to trace back any information to a particular student.

The recommendation system was based on two main methods seen in the course : spectral clustering and graph diffusion.

2. Data Cleaning

The data cleaning was done in the Data Cleaning notebook, for any additional details regarding what was done in this section, please refer to the original notebook. In this notebook both the enrollements and the course descriptions were cleaned in order to have the information under a format which was more appropriate for the graph constructions. Non essential fields were also removed or merged the ones which were deemed to be redundant.

2.1. The Enrollements

2.1.1 Identifying relevant fields

This data is comprised of multiple fields : 'CourseEnrolmentID', 'PersonID', 'SubjectID', 'Year', 'Semester', 'Session', 'SubjectName', 'LevelName', 'TeachingLanguage', 'StudyDomain', 'SectionName', 'SectionCode', 'StudyPlanName', 'StudyPlanCode', 'CourseCodes'

The only fields which were of interest to were the:

CourseCodes: unique indentifier for each course which is the same in both the enrollments and the course descriptions databases. The advantage of course codes is that they remain the same, regardless whether the course changes its name or not.
PersonId: unique identifier for each student
Year : the year the course was taken by the student, this was important to define the number of years in which a course was given to have a normalizing factor which accounts for courses which may be newer.
StudyPlanCode : the study plans to which the course is affiliated
SectionCode : the section code which the student is affiliated.

All other fields were dropped.

2.1.2 Unifying Section Codes

The sections codes were then unified to keep only the main section attribute (EL, MT, GM, etc...), removing any other extention such as _HES, _ECH, etc...

2.1.3 Unifying Study Plan Codes

Study plan codes also needed to be unified as semestre automne and printemps were often input instead of BA1 and BA2 for example. To do the course code which indicates the year was used. Therefore a course given in 4th year, semestre automne would be MA1. A course given in 3rd year semestre de printemps would be BA6.

2.1.4 Handling Cases Where Students Failed Courses And Followed Them Multiple Times

To avoid high weights due to courses which were taken multiple times by the same students, only the record of their first enrollement was kept.

2.1.5 Removing Irrelevant Courses

All students following internships or in the CMS or first year were removed as these were irrelevant to the application. Courses that no longer exist were also removed as there was no point in recommending them. The social science courses were also dropped as they are not linked to other courses in the study plans.

2.1.6 Final DataFrame

Given all these modifications the following dataframe was obtained.



In [2]:

    
enrol = pd.read_pickle("../data/cleanedAR_enrol_initial.pickle")
enrol.head()









    Out[2]:







  
    
      
      PersonID
      SubjectID
      Year
      Semester
      SubjectName
      SectionName
      StudentSectionCode
      StudyPlanCode
      CourseCodes
      CourseSection
      CourseID
    
  
  
    
      0
      58e958ea4531b5624c0a827ed0b139ce
      62918990845509bd69e684af98230cf2
      2007-2008
      Master semestre 1
      Optics III
      Physique
      PH
      MA3
      PHYS-449
      PHYS
      449
    
    
      1
      ae7eb0574297b5e692272779cd2f6fa0
      62918990845509bd69e684af98230cf2
      2007-2008
      Master semestre 1
      Optics III
      Physique
      PH
      MA3
      PHYS-449
      PHYS
      449
    
    
      2
      43bd48d5f43035da625c4db1945307b2
      62918990845509bd69e684af98230cf2
      2007-2008
      Master semestre 1
      Optics III
      Physique
      PH
      MA3
      PHYS-449
      PHYS
      449
    
    
      3
      6d1316f513268449837932316ea0d696
      62918990845509bd69e684af98230cf2
      2007-2008
      Master semestre 1
      Optics III
      Physique
      PH
      MA3
      PHYS-449
      PHYS
      449
    
    
      4
      f3562b61dc25a46c900bd0b2d078cc91
      62918990845509bd69e684af98230cf2
      2007-2008
      Master semestre 1
      Optics III
      Physique
      PH
      MA3
      PHYS-449
      PHYS
      449

2.2. Course Descriptions

This data was more complex than the enrollments as it contained more fields which were not standardized. The course descriptions are filled out by professors and, although there is a similar structure in all of them, there was a large portion of data cleaning and merging which had to be done.

2.2.1 Identifying relevant fields

The fields provided in this database are the following : 'AcademicYear', 'CourseCode', 'CourseTitleFR', 'CourseTitleEN', 'ProfessorSCIPERs', 'AssistantSCIPERs', 'StudyPlansFR', 'StudyPlansEN', 'Summary_FR', 'Summary_EN', 'KeyWords_FR', 'KeyWords_EN', 'Content_FR', 'Content_EN', 'ImportantConcepts_FR', 'ImportantConcepts_EN', 'MandatoryPrerequirements_FR', 'MandatoryPrerequirements_EN', 'IndicativePrerequirements_FR', 'IndicativePrerequirements_EN', 'PreparesFor_FR', 'PreparesFor_EN', 'Handouts_FR', 'Handouts_EN', 'Bibliography_FR', 'Bibliography_EN', 'TeachingMethod_FR', 'TeachingMethod_EN', 'EvaluationMethod_FR', 'EvaluationMethod_EN', 'ExpectedWork_FR', 'ExpectedWork_EN', 'Other_FR', 'Other_EN', 'ProjectManagement', 'WorkInSociety', 'PersonalEffectiveness', 'TrainingAchievements', 'WorkInGroupsAndOrgs', 'CommunicateInformation'

Many fields are provided twice, once in english and once in french which requires merging. There are also fields which are not relevant to the aplication such as the teaching method, evaluation method, expected work, project management, work in society etc...

That is why only the ones which were deemed as being of interest when constructing the final graph were kept :

'CourseCode' : unique indentifier for each course which is the same in both the enrollments and the course descriptions databases
'CourseTitle' in french and english: for string matching to match course names to course codes in requirements and prepares for
'ProfessorSCIPERs' : to construct a professor's graph
'AssistantSCIPERs' : to see in an assistant's graph could be created as the assitant's sciper do not always figure in the course descriptions
'StudyPlans' in french and english : to link each couse to the official study plans it belongs to
'Summary', 'Content', 'ImportantConcepts', 'KeyWords' in french and english : to have a database of words which are linked for the topics graph
'MandatoryPrerequirements', 'IndicativePrerequirements', 'PreparesFor' in french and english : to construct the pre-requirements and prepares for graph

2.2.2. Removing Irrelevant Courses For The Application

The social science courses were removed as they are elective courses which are often unrelated to the rest of the studies. Other courses were removed later on following the analysis conducted on the students' graph (also called enrolment's graph) in section 3.1. and the study plans' graph in section 3.2.

First and second year courses were also removed as most sections do not give the possiblity of choosing elective courses at this level. Furthermore most of these courses are basic polytechnic courses which are not highly specialized. SHS courses were also removed as they werre considered irrelevant for the course suggestor. Note that the focus of the recommendation system was finally placed on the master courses as that is where students have the most choice.

2.2.3. Simplifying the Study Plans

In the same way as for the course enrollements, the study plans were simplified to keep only the main categories : 'AR', 'Area and Cultural Studies minor', 'Auditeurs en ligne', 'CGC', 'Computer engineering minor', 'EL', 'EL MNIS', 'GC', 'GM', 'HD', 'Hors plans', 'IF', 'IN', 'Information security minor', 'MA', 'MES', 'MT', 'MTEE', 'MX', 'Mineur : Biocomputing', 'Mineur : Biotechnologie', 'Mineur : Design intégré, architecture et durabilité', 'Mineur : Développement territorial et urbanisme', 'Mineur : Informatique', 'Mineur : Management de la technologie et entrepreneuriat', 'Mineur : Neuroprosthétiques', 'Mineur : Neurosciences computationnelles', 'Mineur : Systems Engineering', 'Mineur : Systèmes de communication', 'Mineur : Technologies biomédicales', 'Mineur : Technologies spatiales', 'Mineur : Énergie', 'Mineur STAS', 'PH', 'SC', 'SHS', 'SIE', 'SV', 'UNIL'

2.2.4. Unifying and Merging Requirements

The requirements were not necessarily provided under the form of course codes. Therefore the course codes needed to be indentified from the string and correctly formatted if present, otherwise they needed to be inferred from the course name provided, knowing that it was not necessarily exact. For the first part regex was used to identify course codes in strings. For the second part string edition distance was used in order to best match the course name provided to the official course name and then map it to the corresponding course code.

2.2.5. Formatting keywords, contents, summaries and concepts

To compute a distance between courses based on keywords, the first thing which needed to be done was to obtain a set of terms per course. These terms were given by the keywords, contents, summaries and concepts which need to be unified and lemmatized before processing.

The resulting dataframe can be seen below.



In [3]:

    
courses = pd.read_pickle("../data/cleanedAR_courses_initial.pickle")
courses.head()









    Out[3]:







  
    
      
      AcademicYear
      CourseTitleFR
      CourseTitleEN
      ProfessorSCIPERs
      AssistantSCIPERs
      KeyWords_FR
      KeyWords_EN
      StudyPlans
      Requirements
      Summary_Concepts_Contents_EN
      Summary_Concepts_Contents_FR
    
    
      CourseCode
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      AR-301(d)
      2016-2017
      Théorie et critique du projet BA5 (Fröhlich M....
      Studio BA5 (Fröhlich M. & A.)
      225189;225190
      NaN
      [sculpture et expression, habitat et commerce,...
      [tradition and modernity, location and context...
      AR
      []
      [the atelier will examine the programmatic and...
      [l'atelier examinera le programme et l'organis...
    
    
      AR-302(d)
      2016-2017
      Théorie et critique du projet BA6 (Fröhlich M....
      Studio BA6 (Fröhlich M. & A.)
      225189;225190
      NaN
      []
      []
      AR
      []
      [the atelier will examine the programmatic and...
      [l'atelier examinera le programme et l'organis...
    
    
      AR-321
      2016-2017
      Théorie de l'architecture V
      Theory of architecture V
      263770
      NaN
      [théorie de l'architecture, société, histoire,...
      [theory of architecture, modernity, internatio...
      AR
      [AR-322, AR-321]
      [fluidity, atmosphere, political, the course o...
      [le but est d'interpréter et de critiquer de t...
    
    
      AR-322
      2016-2017
      Théorie de l'architecture VI
      Theory of architecture VI
      263770
      NaN
      [théorie de l'architecture, société, histoire,...
      []
      AR
      [AR-321]
      [the course offer an overview of architectural...
      [le but est d'interpréter et de critiquer de t...
    
    
      AR-323
      2016-2017
      Histoire de l'architecture V
      History of architecture V
      178030
      NaN
      []
      []
      AR
      []
      [the course aim to provide critical tool for a...
      [avec un accent particulier sur le théories de...

3. Graph Construction and Analysis

Given the information collected from the cleaned dataset, different graphs were constructed and analyzed to find the relations between the different courses, and thus the influence of all these graphs. The goal was to link all courses with different manners to represent all interests from students. In the end, a final graph was constructed as a weighted sum of the the most relevant graphs, taking into account the relative importance of the information provided by each.

The main source of information was given by the student themselves and the courses they took, as it was information given over several years (since 2007). Being the most important piece of information, this graph became the baseline for the analysis and the recommendation system.

Other important pieces of information were considered, taking into account the fact that they had to link courses in a way that brought another type of connection, useful for students that want to discover something other than the mandatory courses from their section.

This was provided by the professors, the assistants (if given), the topics (keywords of a course) and pre-requirements (and in the other way, the preparing for courses) which were constructed in the Graph Creation notebook (enrolments, study plans, professors and assistants), the Graph Requirements notebook and the Topics Graph notebook.

Note that different graphs were created using a generic function from the graph module and the analysis was conducted in the Graph Analysis notebook.

3.1. Student's Graph

This graph was constructed while keeping in mind the fact that two courses should be connected if a student has taken them both over their curriculum. To avoid giving high weights to courses which were given over a large number of years, the weights were normalized by taking into account the number of years the courses co-existed when increasing the weights. This was done in the Graph Creation notebook and the figures generated in the Graph Analysis notebook.

The following figures correspond to the weight matrix and the weighted degree distribution of the network.

There are a few clusters which seem to appear as can be seen by the block diagonals which appear in certain areas.

Looking at the weights distribution there are a few courses which are highly connected with large degrees.

The graph of the Bachelor and Master courses for all sections combined was then plotted on the eigenvectors of the normalized Laplacian and analyzed using pyGSP.

Visually, it would seem that there are certain groups which can be distinguished by looking at the branches formed. However only one big component remains as can be seen below (the size of the giant component is equal to the number of nodes).



In [4]:

    
# Load the weight matrix
pkl_file = open(os.path.join(os.getcwd(), 'Graphs','students_graph_with_AR.pkl'), 'rb')
students_mat = pickle.load(pkl_file)
pkl_file.close()

# Create the graph and find the giant component.
G1=nx.from_numpy_matrix(students_mat)
Gcc=sorted(nx.connected_component_subgraphs(G1), key = len, reverse=True)[0]

# Compare the giant component to the amount of nodes.
print("The size of the giant component is "+str(len(Gcc)))
print("The number of nodes in the graph is "+str(len(students_mat)))









    



The size of the giant component is 682
The number of nodes in the graph is 682

Using the normalized Laplacian and observing the graph on the first two eigenvectors, the graph constructed from the enrolments forms several clusters that are more or less distant. The results where the different sections and faculties are highlighted can be seen in the following figures. Note that only a few outputs are shown here, all other graphs can be seen in the original notebook

The first thing to see is that one big cluster is distinguishable from the others. This corresponds exatcly to the courses contained in the AR study plan (or more specifically, with a link with AR section), as can be seen after highliting the courses based on the study plans.

There is another distinct cluster, although less evident due to the scale imposed by the presence of AR. This ones corresponds to GC and SIE courses which occupy each one portion of the same branch. One thing to add is that half courses from SIE are not in this cluster because they are implicated in other study plans as well, that have a bigger influence.

Thus, it is possible to say that the ENAC faculty (apart from some SIE courses) has few, or weak, connections with the rest of the faculties. Students from one side has then less chance to take courses from the other side. It was then decided that the ENAC faculty could be dropped from the other graphs too.

Certain sections are well grouped into one distinct branch such as Systèmes de Communications whilst others are more dispersed such as MT, due to it's interdisciplinary nature and overlap with other STI sections.

Finally, the graphs show that all distinct clusters correspond to faculty, section, or even orientations within the different master curriculums. Note however that there are certain faculties which are more disperse than others. The IC faculty occupies its own branch whilst the ENAC faculty is quite disperse but relatively separable from the rest. The STI faculty on the other hand forms one big cluster at the center.

It is important to remark that minors are spread out over all sections as they are accessible to all students.

Generally, this demonstrates that students have a tendency to take courses provided by their sections and are less inclined to take others. This is most likely due to the fact that certain sections do not allow students to take many courses outside their study plans as well as misinformation or ignorance. This is where the current project can be useful which is why information linking courses accross the different sections should be incorporated.

3.2. Study Plan's Graph

This graph was constructed in the Graph Creation notebook and the figures generated in the Graph Analysis notebook.

In this graph, two courses are connected if they are part of the same study plan. This was compared to the enrolments graph as it bore the largest similarity.

There seems to be a high correlation between all courses albeit architecture courses. This confirms what was observed previously. Note that the different faculties and sections are less distinguishable on the eigenmaps using the study plans but maintain similar configuration as in the enrolments graph as can be seen below.

The projection on the first two eigevnectors of the normalized Laplacian show two main clusters which are orthogonal in their correlations.

The farthest cluster is the architecture cluster as mentioned previously.

The SIE and GC clusters are once again located on one side and relatively grouped together.

This time the SC is not in a neatly distinguishable branch.

Not surprisingly, MT overlaps once again over a portion of the graph.

However, using a spring view, the different study plans are more visible with the different components corresponds relatively well to the different sections.

The graph is connected (no subgraph) and has only one big component as can be seen below. The information is therefore redundant with the baseline (enrolments' graph)which is why it was not kept.



In [5]:

    
# Load the weight matrix
pkl_file = open(os.path.join(os.getcwd(), 'Graphs','section_graph_with_AR.pkl'), 'rb')
students_mat = pickle.load(pkl_file)
pkl_file.close()

# Create the graph and find the giant component.
G1=nx.from_numpy_matrix(students_mat)
Gcc=sorted(nx.connected_component_subgraphs(G1), key = len, reverse=True)[0]

# Compare the giant component to the amount of nodes.
print("The size of the giant component is "+str(len(Gcc)))
print("The number of nodes in the graph is "+str(len(students_mat)))









    



The size of the giant component is 682
The number of nodes in the graph is 682

⚠️ ⚠️ ⚠️ In order to be able to judge of the pertinence of the recommender system and given the nature of the baseline graph (i.e. enrolments) which clustered the faculties relatively well, the decision was taken to do the recommendations only for the STI faculty. Only master courses were considered as this is where the students have the most flexibility and the least guidance. Therefore the following graphs are shown only for courses taken by students in the STI faculty during the master curriculum. ⚠️ ⚠️ ⚠️

3.3. Professor's Graph

In this graph, two courses are connected if they were teached by the same professor. This graph was considered as professors tend to teach in similar domains therefore giving potential continuity in the curriculum and the domain of interest. The graph shown is the one corresponding to the master courses in STI. Refer to the Graph Creation notebook for the creation of the graph and the Graph Analysis notebook for the generation of the different figures.

Unfortunately this graph is quite sparse as can be seen in by observing the weight matrix leading to many disconnected subgraphs.

The graph is not made up of one or several large component but rather a multitude of small components or isolated nodes.

In the following graphs, the degree distribution and the distribution of giant components can be seen respectively.

Most nodes are of low distribution with 3 or less connections. Very few have high distributions and those that do are due to multiple professors hosting the same course and giving other courses with multiple professors. The node with the highest degree is the Lab in EDA design which is linked to :

Analog circuits design I,
Hardware systems modeling I,
Analog circuits design II,
Advanced VLSI design,
IC design II,
Hardware systems modeling,
Fundamentals of VLSI design,
Introduction to VLSI Design

Most components are made up of less than 5 nodes with very few exceeding that value. The largest component is of size 16, regrouping professors from STI and SV, and corresponds to the following courses:

Sensorimotor neuroprosthetics,
Analysis and modelling of locomotion,
Nanotechnology,
Advanced machine learning,
Applied machine learning,
BioMEMS,
Materials and technology of microfabrication,
Robotics practicals
Technologie des microstructures I
Sensors in medical instrumentation
Mobile robots
Scaling laws in micro- and nanosystems
Brain computer interaction
Capteurs
Fundamentals of neuroengineering
Seminar in physiology and instrumentation

These courses have very little link between them. The more hops there are between the courses the less they are related. Given that as well as the fact that this graph is very sparse, this graph was not considered for the final recommender system.

3.4. Assistants Graph

Doing the same for the assistants and plotting for the STI faculty, the results obtained are even less convincing given the even sparser nature of the weight matrix. This is due to the fact that professors do not necessarily input the assistant scipers into the course descriptions. Most likely the doctoral assistants who participate usually in just one or two courses are input but not the student assistants which participate in a large panel of courses. Refer to the Graph Creation notebook for the corresponding code

Weight matrix

Visualization of the graph using networkx

Unweighted degree distribution

Distribution of the size of components in the graph

3.5. Topics' Graph

This graph was constructed by taking various textual elements in the course descriptions, in particular the keywords, the summaries, contents and concepts as these relate to the topics of the course. The idea was to extract keywords from the various course descriptions and link them using topic detections methods.

The first step was to format the text and lemmatize it to reduce the words to simpler forms, that way for example plural terms will be reduced to their singular form and they will be indentified as the same. This was done using the NLTK package. Stopwords were also removed as they are often irrelevant to the topics themselves.

Once the corpus of words was created for all the different courses, the TFIDF was computed. The TFIDF is defined as the "term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus" (Wikipedia page). Therefore the most relevant words in the corpus will be those that are the most unique and as such those that will better describe the topics. This was done using the TfidfVectorizer from the scikit learn package.

Then, given a matrix of Courses x Words, matrix factorization was used in order to determine the various topics and obtain a Courses x Topics matrix. A weight matrix between the different courses was then computed. This was done using Latent Dirichlet Annotation (LDA) from the scikit-learn library.

Unfortunately the topic detection was not functional for multiple reasons, the main one being that there were very little samples compared to the number of keywords to be analyzed. The results obtained in the Topics Graph - Keywords, Summaries, Concepts notebook were the following :


Extracting tf features for LDA...
Fitting LDA models with tf features, n_samples=196 and n_features=1919...

Topics in LDA model:
Topic #0: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
Topic #1: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
Topic #2: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
Topic #3: 6ghz aberration ability able abstract acoustic acquaint acquire acquires acquisition active act activity actuator add addition address adaptive advance afm
Topic #4: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
Topic #5: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
Topic #6: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
Topic #7: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
Topic #8: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
Topic #9: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend

As can be seen, the topics are nearly all identical which is suprising given the low number of topics to be extracted. That is why this graph was not used although it could have offered links between courses related to topics without relying on a professor actually defining the link.

3.5. Pre-Requirements and Prepares for Graph

In order to take certain courses, students may need to have specific background skills that they acquired by following other courses. These ones are called pre-requirement courses. It may be interesting to link courses based on this aspect. Once the requirements were linked to the course codes, it was possible to construct 3 graphs based on this feature (this was done in the Graph Requirements notebook):

The first graph links a course to its requirements : a student may be interested by a course that require the same skills he developped by following other lectures. As can be seen in the following figures showing the weight matrix and the degree distribution, the graph is sparse as some courses do not have requirements listed or have generic requirements which are not given by a specific course.

The second links the requirements of the same course together : If different courses are required for the same lecture, they may present the same topics. This graph contains comparatively more information as can be seen by the weight matrix which is less sparse and the distribution of degrees which decreases less quickly. It will be shown that this graph is the best one where the requirements are concerned when speaking about the graph diffusion and combining the different graphs.

The third graph that links courses which have the same requirements together : If different courses require that a student to have a particular set of skills, they may also present the same topics. This graph is also very spase. Similarly as in the case of the first graph, looking at the weight distribution, the degrees are essentially concentrated around 1 and decrease rapidly.

Overall, these graphs are quite sparse which is most probably due to two factors :

The first is that the requirements are enterred manually and do not necessarily correspond to a specific course or even a course at all but rather a topic. There are even certain course descriptions which provide generic requirements, such as knowledge in algebra and calculus. Therefore there are certain requirements which may exist without being linked and therefore are not represented in the graphs.
The second is that the courses on the master level are highly specific and may not actually have direct links between them.

4. Recommender System

4.0. For Master Students in the STI Faculty

As mentioned previoulsy, the decision was taken to focus solely courses taken by students in the STI faculty over their curriculum. This was done on the baseline graph with the enrolments with the idea of enriching it using the different requirements graph, following the analysis in the previous section. Note that the following images were generated in the Graph Analysis notebook

Graph visualized along the first two eigenvectors of the normalized Laplacian

Weight matrix : it is interesting to see that the weight matrix is nearly block diagonal with 5 blocks corresponding to the 5 sections in the STI master curriculums.

Unweighted degree distribution : the degree distribution is nearly symmetrical. Note that the course with the highest degree is the Space Mission Design and Operations course which is a popular course amongst students as it is given by Claude Nicollier who talks about his adventures in space. Other courses with high degrees often correspond to other appreciated courses such as this one or mandatory courses. The latter is problematic in the sense that it may bias the recommender system towards courses that students must take rather than courses that students appreciate.

Highlighting the different sections as was done in the analysis, it can be shown that the MX, GM and EE sections have their own branches whilst the MT, BIOENG and a portion of EE are dispersed over all the sections.

Using the resulting graph, the objective was therefore to recommend the most relevant courses using notions of spectral clustering and graph diffusion seen in the course. One could have imagined simply taking the neighbours of a course of interest but this would not have lead to a coherent set of courses such as what would be obtained using clustering or the optimal list of courses provided multiple inputs.

4.1. Unsupervised Clustering : K-Means and GMM

This portion of the project was done in the Spectral Clustering notebook.

4.1.1. Hard Clustering With K-Means

The objective here was to see whether it would be possible to isolate a given number of coherent clusters which could be used as a basis for recommendation. What needed to be done was to select the number of clusters k. The method presented in the course to select the optimal k was based on the spectral gap. Unfortunately, this relies on the fact that there are clusters which can more or less easily be separated. In the present case, there is one giant component which is highly linked for the STI faculty which is why the output is k = 1.

Eigenvalues of the normalized Laplacian of the baseline graph

Another method presented in the Applied Data Analysis course is to base the selection of k on the silhouette. The silhouette is defined by :

$$ S = mean(\sum_{i}^{N} s(i)) = mean(\sum_{i}^{N} \frac{b(i) - a(i)}{\max{a(i), b(i)}})$$

where :

$a(i)$ is the distance to the own centroid
$b(i)$ is the distance to the closest other centroid
$S$ is the average of the silhouette at each data point i

Computing the silhouette and selecting the value of k which gives the maximal output S is the best result.

The optimal value is therefore k = 8 and the results of the clustering can be seen in the following figure.

The clusters obtained are the following and are coherent with personal knowledge of the faculties :

Cluster 1 related to material science :[['Mécanique vibratoire'], ['Conception pour X'], ['Composants semiconducteurs'], ['Technologie des microstructures I'], ['Biomaterials (pour MX)'], ['Advanced nanomaterials'], ['Tribology'], ['Composites technology'], ['Thin film fabrication processes'], ['Life cycle engineering of polymers'], ['Cementitious materials (advanced)'], ['Matériaux pierreux et conservation du patrimoine bâti'], ['Powder technology'], ['Micro and nanostructuration of materials'], ['Materials selection'], ['Recycling of materials'], ['Assembly techniques'], ['Wood structures, properties and uses'], ['Polymer morphological characterization techniques'], ['Physical chemistry of polymeric materials'], ['Properties of semiconductors and related nanostructures'], ['Dielectric properties of materials'], ['Electron microscopy: advanced methods'], ['Organic electronic materials - synthesis, applications, properties'], ['Polymer chemistry and macromolecular engineering'], ['Atomistic and quantum simulations of materials'], ['Electrochemistry for materials technology'], ['Introduction to crystal growth by epitaxy'], ['Fundamentals of solid-state materials'], ['Soft matter'], ['Modelling problem solving, computing and visualisation I'], ['Fracture of materials']]

Cluster 2 related to micro-electronics: [['Nanoelectronics'], ['Analog circuits design I'], ['HF and VHF circuits and techniques I'], ['Hardware systems modeling I'], ['Advanced analog and RF integrated circuits design I'], ['Test of VLSI systems'], ['Physical models for micro and nanosystems'], ['Modeling of emerging electron devices'], ['HF and VHF circuits and techniques II'], ['Integrated circuits technology'], ['Advanced analog and RF integrated circuits design II'], ['Analog circuits design II'], ['Advanced VLSI design'], ['Advanced lab in electrical engineering'], ['Advanced wireless communications: algorithms and architectures'], ['Electrical filters'], ['Lab in EDA based design'], ['Hardware systems modeling'], ['Fundamentals of VLSI design'], ['Bio-nano-chip design'], ['Lab in microelectronics'], ['Analog circuits for biochip'], ['Introduction to VLSI Design']]

Cluster 3 related to fluid dynamics and the mechanical engineering cursus : [['Two-phase flows and heat transfer'], ['Turbomachines thermiques'], ['Hydraulic turbomachines'], ['Numerical flow simulation'], ['Fracture mechanics'], ['Dynamique numérique des solides et des structures'], ['Numerical methods in heat transfer'], ['Hydrodynamique acoustique'], ['Hydrodynamics'], ['Modelling and optimization of energy systems'], ['Production management'], ['Advanced energetics'], ["Cavitation et phénomènes d'interface"], ['Engines and fuel cells'], ['Renewable energy (for ME)'], ['Thermal power cycles and heat pump systems'], ['Particle-based methods'], ['Aéroélasticité et interaction fluide-structure'], ['Aerodynamics'], ['Lifecycle performance of product systems'], ['Mechanics of composites'], ['Introduction to nuclear engineering'], ['Advanced heat transfer'], ['Instability'], ['Turbulence'], ['Composites polymères + TP']]

Cluster 4 related to the Optics and MNS orientations in the micro-engineering cursus : [['Optical detectors'], ['Materials and technology of microfabrication'], ['Microelectronics'], ['Lasers: theory and modern applications'], ['Scaling laws in micro- and nanosystems'], ['Analyse de produits et systèmes'], ['Optics laboratories I'], ['Laser microprocessing'], ['Fundamentals & processes for photovoltaic devices'], ['Integrated optics'], ['Large-area electronics: devices and materials'], ['Image optics'], ['Advanced MEMS'], ['Optical waves propagation'], ['Nanotechnology'], ['Advanced machine learning'], ['Circuits intégrés I'], ['IC design II']]

Cluster 5 related to the information technologies cursus in EE :[['Automatic speech processing'], ['Propagation of acoustic waves'], ['Microwaves'], ['Systems and architectures for signal processing'], ['Image and video processing'], ['Rayonnement et antennes'], ['Audio'], ['Optical communications'], ['Photonic systems and technology'], ['Media security'], ['Image communication'], ['Image analysis and pattern recognition'], ['Speech processing'], ['Biomedical signal processing'], ['Brain computer interaction'], ['Wireless receivers: algorithms and architectures'], ['Social media'], ['Lab in signal and image processing'], ['Lab on app development for tablets and smartphones'], ['Lab in microwaves'], ['Lab in acoustics'], ['Mathematics of data: from theory to computation']]

Cluster 6 related to the bio-engineering cursus :[['Biomechanics of the musculoskeletal system'], ['Biomechanics of the cardiovascular system'], ['Commande de procédés'], ['Numerical methods in biomechanics'], ['Mécanique des structures (pour GM)'], ['Image processing I'], ['BioMEMS'], ['Biomicroscopy I'], ['Biomicroscopy II'], ['Seminar in physiology and instrumentation'], ['Signal processing for functional brain imaging'], ['Flexible bioelectronics'], ['Sensors in medical instrumentation'], ['Fundamentals of biosensors and electronic biochips'], ['Data analysis and model classification'], ['Introduction to nanomaterials'], ['Materials Science'], ['Biomedical optics'], ['Biotechnology lab (for CGC)'], ['Biomaterials'], ['Introduction to cellular and molecular biotechnology'], ['Pharmaceutical biotechnology'], ['Chemical biology - tools and methods'], ['Advanced bioengineering methods laboratory'], ['Lab methods : histology'], ['Lab methods : bioactive compounds screening'], ['Lab methods : animal experimentation'], ['Lab methods : biosafety'], ['Lab methods : flow cytometry'], ['Lab methods : proteomics'], ['Fundamentals of neuroengineering'], ['Tissue engineering'], ['Sensorimotor neuroprosthetics'], ['Analysis and modelling of locomotion'], ['Scientific literature analysis in Neuroscience'], ['In Silico neuroscience']]

Cluster 7 related to the energy specialization in EE : [['Power systems dynamics'], ['Industrial electronics I'], ['Industrial electronics II'], ['Power system restructuring and deregulation'], ['Wave propagation along transmission lines'], ['Multivariable control and coordination systems'], ['Electromagnetic compatibility'], ['Hydropower plants: generating and pumping units'], ['Semester project for specialisation C'], ['Smart grids technologies'], ['Energy storage systems'], ['Advanced lab in electrical energy systems']]

Cluster 8 related to the robotics specialization in the MT cursus : [['System identification'], ['Methods for rapid production and development'], ['Fabrication assistée par ordinateur'], ['Commande non linéaire'], ['Computer-aided engineering'], ['Multi-body simulation'], ['Simulation and optimisation of industrial applications'], ['Advanced control systems'], ['Systèmes mécatroniques'], ['Conception mécanique intégrée'], ['Model predictive control'], ['Robotics practicals'], ["Commande d'actionneurs à l'aide d'un microprocesseur + TP"], ['Mobile robots'], ['Image processing II'], ['Capteurs'], ['Bases de la robotique'], ['Robotique industrielle et appliquée'], ['Haptic human robot interfaces'], ['Applied machine learning'], ['Spacecraft design and system engineering'], ['Transducteurs et entraînements intégrés'], ['Conception de mécanismes spatiaux : introduction'], ['Lessons learned from the space exploration'], ['Space mission design and operations'], ['Spacecraft avionics architectures'], ['Introduction to planetary sciences']]

These clusters are coherent as they tend towards the study plans of certain sections or their respective specializations. Therefore, when a course is input into the recommender system, it will output the courses of the cluster, sorted by importance with respect to the original course. The relative importance is given by the distance between the course of interest and the recommendations.



In [ ]:

4.1.2. Hard Clustering With GMM

K-Means clustering is a particular case of Gaussian Mixture Models with isotropic covariance matrices. There is no guarantee that the distribution of nodes in the course space follows an isotropic distribution. That is why Gaussian Mixture Models were tested to see whether the possibility of having full covariance matrices and eventually the possiblity of having soft clustering would lead to better outputs.

Using the GMM implemented in scikit-learn it was possible to do unsupervised clustering on the courses and obtain the output class labels corresponding to the most likely class for each course.

Selecting the optimal k was done using AIC and BIC metrics which evaluate the quality of the model for a given K. These metrics are therefore computed for different values of k and the one which leads to the minimum AIC / BIC is selected. The optimal k given by the BIC is 16 as opposed to 24 with AIC.

AIC metric evolution for different k

BIC metric evolution for different k

Using k=16 lead to better outputs than with k=24 which is why it was maintained.

The clusters obtained are the following :

Cluster 1 related to material science: ['Mécanique vibratoire', 'Conception pour X', 'Composants semiconducteurs', 'Technologie des microstructures I', 'Micro and nanostructuration of materials', 'Materials selection', 'Recycling of materials', 'Assembly techniques', 'Wood structures, properties and uses', 'Polymer morphological characterization techniques', 'Physical chemistry of polymeric materials', 'Properties of semiconductors and related nanostructures', 'Dielectric properties of materials', 'Organic electronic materials - synthesis, applications, properties', 'Atomistic and quantum simulations of materials', 'Introduction to crystal growth by epitaxy']

Cluster 2 related to the micro-electronics specialization in EE: ['Optical detectors', 'Nanoelectronics', 'Analog circuits design I', 'HF and VHF circuits and techniques I', 'Test of VLSI systems', 'Physical models for micro and nanosystems', 'Modeling of emerging electron devices', 'Integrated circuits technology', 'Lab in EDA based design', 'Hardware systems modeling', 'Fundamentals of VLSI design', 'Lab in microelectronics', 'Introduction to VLSI Design']

Cluster 3 related to the MNS orientation in MT:['Materials and technology of microfabrication', 'Microelectronics', 'Scaling laws in micro- and nanosystems', 'Analyse de produits et systèmes', 'Laser microprocessing', 'Large-area electronics: devices and materials', 'Advanced MEMS', 'Nanotechnology', 'Circuits intégrés I', 'IC design II']

Cluster 4 related to the bio-engineering curriculum : ['Commande de procédés', 'Biotechnology lab (for CGC)', 'Introduction to cellular and molecular biotechnology', 'Pharmaceutical biotechnology']

Cluster 5 related to bio-enginering curriculum : ['Mécanique des structures (pour GM)', 'Signal processing for functional brain imaging', 'Brain computer interaction', 'Data analysis and model classification', 'Materials Science', 'Fundamentals of neuroengineering', 'Tissue engineering', 'Sensorimotor neuroprosthetics', 'Analysis and modelling of locomotion', 'Scientific literature analysis in Neuroscience', 'In Silico neuroscience']

Cluster 6 related to the energy specialization in EE: ['Power systems dynamics', 'Industrial electronics I', 'Industrial electronics II', 'Power system restructuring and deregulation', 'Wave propagation along transmission lines', 'Multivariable control and coordination systems', 'Electromagnetic compatibility', 'Hydropower plants: generating and pumping units', 'Semester project for specialisation C', 'Smart grids technologies', 'Energy storage systems', 'Advanced lab in electrical energy systems']

Cluster 7 related to the information technology specialization in EE: ['Propagation of acoustic waves', 'Microwaves', 'Systems and architectures for signal processing', 'Image and video processing', 'Audio', 'Optical communications', 'Photonic systems and technology', 'Image communication', 'Image analysis and pattern recognition', 'Speech processing', 'Biomedical signal processing', 'Wireless receivers: algorithms and architectures', 'Lab in signal and image processing', 'Lab on app development for tablets and smartphones', 'Lab in microwaves', 'Lab in acoustics', 'Mathematics of data: from theory to computation']

Cluster 8 related to the robotics curriculum in MT: ['System identification', 'Methods for rapid production and development', 'Fabrication assistée par ordinateur', 'Commande non linéaire', 'Computer-aided engineering', 'Fracture mechanics', 'Dynamique numérique des solides et des structures', 'Multi-body simulation', 'Production management', 'Simulation and optimisation of industrial applications', 'Advanced control systems', 'Systèmes mécatroniques', 'Conception mécanique intégrée', 'Model predictive control', 'Lifecycle performance of product systems', 'Mechanics of composites', 'Robotics practicals', "Commande d'actionneurs à l'aide d'un microprocesseur + TP", 'Mobile robots', 'Capteurs', 'Bases de la robotique', 'Robotique industrielle et appliquée', 'Haptic human robot interfaces', 'Advanced machine learning', 'Transducteurs et entraînements intégrés', 'Composites polymères + TP']

Cluster 9 related to fluid dynamics and mechanics : ['Two-phase flows and heat transfer', 'Turbomachines thermiques', 'Hydraulic turbomachines', 'Numerical flow simulation', 'Numerical methods in heat transfer', 'Hydrodynamique acoustique', 'Hydrodynamics', 'Modelling and optimization of energy systems', 'Advanced energetics', "Cavitation et phénomènes d'interface", 'Engines and fuel cells', 'Renewable energy (for ME)', 'Thermal power cycles and heat pump systems', 'Particle-based methods', 'Aéroélasticité et interaction fluide-structure', 'Aerodynamics', 'Introduction to nuclear engineering', 'Advanced heat transfer', 'Instability', 'Turbulence']

Cluster 10 related to the Optics orientation in MT : ['Lasers: theory and modern applications', 'Optics laboratories I', 'Integrated optics', 'Image optics', 'Optical waves propagation']

Cluster 11 related to the bio-engineering curriculum : ['Lab methods : histology', 'Lab methods : bioactive compounds screening', 'Lab methods : animal experimentation', 'Lab methods : biosafety', 'Lab methods : flow cytometry', 'Lab methods : proteomics']

Cluster 12 related to the micro-electronics specialization in EE: ['Hardware systems modeling I', 'Advanced analog and RF integrated circuits design I', 'HF and VHF circuits and techniques II', 'Advanced analog and RF integrated circuits design II', 'Analog circuits design II', 'Advanced VLSI design', 'Advanced lab in electrical engineering', 'Advanced wireless communications: algorithms and architectures', 'Electrical filters', 'Bio-nano-chip design', 'Analog circuits for biochip']

Cluster 13 related to the minor in space: ['Spacecraft design and system engineering', 'Conception de mécanismes spatiaux : introduction', 'Lessons learned from the space exploration', 'Spacecraft avionics architectures', 'Introduction to planetary sciences']

Cluster 14 related to the MT and MX curriculum:['Biomechanics of the musculoskeletal system', 'Biomechanics of the cardiovascular system', 'Numerical methods in biomechanics', 'Image processing I', 'BioMEMS', 'Biomicroscopy I', 'Image processing II', 'Biomicroscopy II', 'Fundamentals & processes for photovoltaic devices', 'Seminar in physiology and instrumentation', 'Flexible bioelectronics', 'Applied machine learning', 'Automatic speech processing', 'Space mission design and operations', 'Sensors in medical instrumentation', 'Fundamentals of biosensors and electronic biochips', 'Biomaterials (pour MX)', 'Advanced nanomaterials', 'Tribology', 'Composites technology', 'Thin film fabrication processes', 'Life cycle engineering of polymers', 'Cementitious materials (advanced)', 'Matériaux pierreux et conservation du patrimoine bâti', 'Powder technology', 'Electron microscopy: advanced methods', 'Introduction to nanomaterials', 'Polymer chemistry and macromolecular engineering', 'Electrochemistry for materials technology', 'Fundamentals of solid-state materials', 'Soft matter', 'Modelling problem solving, computing and visualisation I', 'Fracture of materials', 'Biomedical optics', 'Biomaterials', 'Chemical biology - tools and methods', 'Advanced bioengineering methods laboratory']

Cluster 15 : ['Rayonnement et antennes', 'Media security']

Cluster 16 : ['Social media']

4.1.3. Soft Clustering With GMM

Courses are not necessarily part of one curriculum or one set of relevant courses. That is why soft clustering was considered for the recommendation. This gives more flexibility to the recommender system and could potentially improve the quality and relevance of the outputs.

Using the baseline graph (enrolments), the results obtained lead to hard clustering, this is due to the fact that the gaussians did not overlap sufficiently. However, as it can be seen below, the obtained clusters were different:

Cluster 1 related to the robotics specialization in MT: ['Model predictive control', 'Image processing I', 'Robotics practicals', "Commande d'actionneurs à l'aide d'un microprocesseur + TP", 'Mobile robots', 'Image processing II', 'Signal processing for functional brain imaging', 'Haptic human robot interfaces', 'Advanced machine learning', 'Transducteurs et entraînements intégrés', 'Automatic speech processing', 'Scientific literature analysis in Neuroscience', 'In Silico neuroscience']

Cluster 2 related to the bio-engineering curriculum : ['Biomechanics of the musculoskeletal system', 'Biomechanics of the cardiovascular system', 'Numerical methods in biomechanics', 'Mécanique des structures (pour GM)', 'BioMEMS', 'Biomicroscopy I', 'Biomicroscopy II', 'Seminar in physiology and instrumentation', 'Flexible bioelectronics', 'Applied machine learning', 'Sensors in medical instrumentation', 'Fundamentals of biosensors and electronic biochips', 'Data analysis and model classification', 'Introduction to nanomaterials', 'Materials Science', 'Biomedical optics', 'Biomaterials', 'Chemical biology - tools and methods', 'Advanced bioengineering methods laboratory', 'Fundamentals of neuroengineering', 'Tissue engineering', 'Sensorimotor neuroprosthetics', 'Analysis and modelling of locomotion']

Cluster 3 related to the MNS orientation in MT: ['Materials and technology of microfabrication', 'Microelectronics', 'Scaling laws in micro- and nanosystems', 'Analyse de produits et systèmes', 'Laser microprocessing', 'Large-area electronics: devices and materials', 'Advanced MEMS', 'Nanotechnology', 'Circuits intégrés I', 'IC design II']

Cluster 4 related to the MX curriculum: ['Biomaterials (pour MX)', 'Advanced nanomaterials', 'Tribology', 'Composites technology', 'Thin film fabrication processes', 'Life cycle engineering of polymers', 'Cementitious materials (advanced)', 'Matériaux pierreux et conservation du patrimoine bâti', 'Powder technology', 'Electron microscopy: advanced methods', 'Polymer chemistry and macromolecular engineering', 'Electrochemistry for materials technology', 'Fundamentals of solid-state materials', 'Soft matter', 'Modelling problem solving, computing and visualisation I', 'Fracture of materials']

Cluster 5 related to the Optics orientation in MT : ['Lasers: theory and modern applications', 'Optics laboratories I', 'Integrated optics', 'Image optics', 'Optical waves propagation']

Cluster 6 related to the energy specialization in EE: ['Power systems dynamics', 'Industrial electronics I', 'Industrial electronics II', 'Power system restructuring and deregulation', 'Wave propagation along transmission lines', 'Multivariable control and coordination systems', 'Electromagnetic compatibility', 'Hydropower plants: generating and pumping units', 'Semester project for specialisation C', 'Smart grids technologies', 'Energy storage systems', 'Advanced lab in electrical energy systems']

Cluster 7 related to the ME curriculum: ['System identification', 'Methods for rapid production and development', 'Fabrication assistée par ordinateur', 'Mécanique vibratoire', 'Commande non linéaire', 'Computer-aided engineering', 'Fracture mechanics', 'Dynamique numérique des solides et des structures', 'Multi-body simulation', 'Production management', 'Simulation and optimisation of industrial applications', 'Advanced control systems', 'Systèmes mécatroniques', 'Conception mécanique intégrée', 'Lifecycle performance of product systems', 'Mechanics of composites', 'Capteurs', 'Bases de la robotique', 'Robotique industrielle et appliquée', 'Composites polymères + TP']

Cluster 8 related to the micro-electronics specialization in EE: ['Hardware systems modeling I', 'Advanced analog and RF integrated circuits design I', 'HF and VHF circuits and techniques II', 'Advanced analog and RF integrated circuits design II', 'Analog circuits design II', 'Advanced VLSI design', 'Advanced lab in electrical engineering', 'Advanced wireless communications: algorithms and architectures', 'Bio-nano-chip design', 'Analog circuits for biochip']

Cluster 9 related to the bio-engineering curriculum : ['Commande de procédés', 'Biotechnology lab (for CGC)', 'Introduction to cellular and molecular biotechnology', 'Pharmaceutical biotechnology']

Cluster 10 related to the minor in space: ['Spacecraft design and system engineering', 'Conception de mécanismes spatiaux : introduction', 'Lessons learned from the space exploration', 'Spacecraft avionics architectures', 'Introduction to planetary sciences']

Cluster 11 related to fluid dynamics and mechanics : ['Two-phase flows and heat transfer', 'Turbomachines thermiques', 'Hydraulic turbomachines', 'Numerical flow simulation', 'Numerical methods in heat transfer', 'Hydrodynamique acoustique', 'Hydrodynamics', 'Modelling and optimization of energy systems', 'Advanced energetics', "Cavitation et phénomènes d'interface", 'Engines and fuel cells', 'Renewable energy (for ME)', 'Thermal power cycles and heat pump systems', 'Particle-based methods', 'Aéroélasticité et interaction fluide-structure', 'Aerodynamics', 'Introduction to nuclear engineering', 'Advanced heat transfer', 'Instability', 'Turbulence', 'Fundamentals & processes for photovoltaic devices', 'Space mission design and operations']

Cluster 12 related to the micro-electronics specialization in EE: ['Optical detectors', 'Nanoelectronics', 'Analog circuits design I', 'HF and VHF circuits and techniques I', 'Test of VLSI systems', 'Physical models for micro and nanosystems', 'Modeling of emerging electron devices', 'Integrated circuits technology', 'Lab in EDA based design', 'Hardware systems modeling', 'Fundamentals of VLSI design', 'Lab in microelectronics', 'Introduction to VLSI Design']

Cluster 13 related to the information technology specialization in EE: ['Propagation of acoustic waves', 'Microwaves', 'Systems and architectures for signal processing', 'Image and video processing', 'Audio', 'Optical communications', 'Photonic systems and technology', 'Image communication', 'Image analysis and pattern recognition', 'Speech processing', 'Biomedical signal processing', 'Brain computer interaction', 'Electrical filters', 'Wireless receivers: algorithms and architectures', 'Lab in signal and image processing', 'Lab on app development for tablets and smartphones', 'Lab in microwaves', 'Lab in acoustics', 'Mathematics of data: from theory to computation']

Cluster 14 related to the bio-engineering curriculum : ['Lab methods : histology', 'Lab methods : bioactive compounds screening', 'Lab methods : animal experimentation', 'Lab methods : biosafety', 'Lab methods : flow cytometry', 'Lab methods : proteomics']

Cluster 15 related to material science: ['Conception pour X', 'Composants semiconducteurs', 'Technologie des microstructures I', 'Micro and nanostructuration of materials', 'Materials selection', 'Recycling of materials', 'Assembly techniques', 'Wood structures, properties and uses', 'Polymer morphological characterization techniques', 'Physical chemistry of polymeric materials', 'Properties of semiconductors and related nanostructures', 'Dielectric properties of materials', 'Organic electronic materials - synthesis, applications, properties', 'Atomistic and quantum simulations of materials', 'Introduction to crystal growth by epitaxy']

Cluster 16: ['Rayonnement et antennes', 'Media security', 'Social media']

4.1.4. Recommendations Using Clustering

Given the clusters output by the methods detailed previously, the recommendations are done as follows :

Given a single course

the corresponding cluster or clusters (in the case of the soft clustering) is identified
the courses in the cluster(s) are output based on relative importance. The relative importance is given by the distance to the course of interest.

Given multiple courses

take the cluster which is the most represented in the list of courses of interest

The recommendation function is given in the following cell.



In [6]:

    
def suggest_wrt1(course):
    
    # Deal with the case course is a string and not a list of strings:
    if type(course)==str:
        course=[course]

    print('Results with K-means:\n')
    L=[]
    indices1=[]
    for q in range(len(course)):
        L.append(labels_1[courses_index_dico[course[q]]]) #list the labels of the courses in course
        indices1.append(courses_index_dico[course[q]]) #indices of the chosen course of course

    L=Counter(L).most_common(1)[0][0] #keep the label L of the most common cluster
    ind=np.where(labels_1==L)[0] # get the indices of the courses of the cluster
    indices1=np.array(indices1) 
    indices=list(indices1[list(np.isin(indices1, ind))]) # indices of the chosen courses that are in the most common cluster

    # Compute the distances between the courses of the chosen cluster and the remaining chosen courses:
    dist=scipy.spatial.distance.cdist(F1_K[ind,:],F1_K[indices,:], 'euclidean')
    dist=np.mean(dist, axis=1)
    
    # Output the list of courses in the cluster ordered with respect to the closest distance with the chosen courses:
    ind=ind[np.argsort(dist)]    
    print(Courses[ind])
    print([courses2[courses2.index.str.endswith(Courses[ind][p])].CourseTitleFR.tolist()[0] for p in range(len(ind))])
    print('\n')
    
    print('\nResults with GMM soft-clustering:\n')
    if len(course)==1:
        L=multi_labels[courses_index_dico[course[0]]] #keep the labels L of the clusters
    else:
        L=[item for q in range(len(course)) for item in multi_labels[courses_index_dico[course[q]]]]
        L=[Counter(L).most_common(1)[0][0]] #keep the label L of the most common cluster
        
    for p in range(len(L)):
        ind=np.array(labels_matrix[L[p]]) # get the indices of the courses of the cluster
        indices=list(indices1[list(np.isin(indices1, ind))]) # indices of the chosen courses that are in the most common cluster
        
        # Compute the distances between the courses of the chosen cluster and the remaining chosen courses:
        dist=scipy.spatial.distance.cdist(F1_gmm_s[ind,:],F1_gmm_s[indices,:], 'euclidean')
        dist=np.mean(dist, axis=1)
        
        # Output the list of courses in the cluster ordered with respect to the closest distance with the chosen courses:
        ind=ind[np.argsort(dist)]
        print(Courses[ind])
        print([courses2[courses2.index.str.endswith(Courses[ind][p])].CourseTitleFR.tolist()[0] for p in range(len(ind))])
        print('\n')

    print('\nResults with GMM hard-clustering:\n')
    L=[]
    for q in range(len(course)):
        L.append(labels_1c[courses_index_dico[course[q]]])
    L=Counter(L).most_common(1)[0][0] #keep the label L of the most common cluster
    ind=np.where(labels_1c==L)[0] # get the indices of the courses of the cluster
    indices=list(indices1[list(np.isin(indices1, ind))]) # indices of the chosen courses that are in the most common cluster

    # Compute the distances between the courses of the chosen cluster and the remaining chosen courses:
    dist=scipy.spatial.distance.cdist(F1_gmm_h[ind,:],F1_gmm_h[indices,:], 'euclidean')
    dist=np.mean(dist, axis=1)
    
    # Output the list of courses in the cluster ordered with respect to the closest distance with the chosen courses:
    ind=ind[np.argsort(dist)]
    print(Courses[ind])
    print([courses2[courses2.index.str.endswith(Courses[ind][p])].CourseTitleFR.tolist()[0] for p in range(len(ind))])

For the set of courses ['Nanoelectronics', 'Analog circuits design I'], the recommandations gave the following outputs:

Results with K-means: ['Nanoelectronics', 'Analog circuits design I', 'Fundamentals of VLSI design', 'Lab in EDA based design', 'Test of VLSI systems', 'HF and VHF circuits and techniques I', 'Modeling of emerging electron devices', 'Physical models for micro and nanosystems', 'Hardware systems modeling I', 'Hardware systems modeling', 'Advanced analog and RF integrated circuits design I', 'Introduction to VLSI Design', 'Integrated circuits technology', 'Analog circuits design II', 'Bio-nano-chip design', 'Analog circuits for biochip', 'Advanced VLSI design', 'Advanced analog and RF integrated circuits design II', 'Lab in microelectronics', 'HF and VHF circuits and techniques II', 'Advanced wireless communications: algorithms and architectures', 'Electrical filters', 'Advanced lab in electrical engineering']

Results with GMM soft-clustering: ['Nanoelectronics', 'Analog circuits design I', 'Test of VLSI systems', 'Lab in EDA based design', 'HF and VHF circuits and techniques I', 'Fundamentals of VLSI design', 'Modeling of emerging electron devices', 'Physical models for micro and nanosystems', 'Integrated circuits technology', 'Hardware systems modeling', 'Lab in microelectronics', 'Introduction to VLSI Design', 'Optical detectors']

Results with GMM hard-clustering: ['Nanoelectronics', 'Analog circuits design I', 'Test of VLSI systems', 'Lab in EDA based design', 'HF and VHF circuits and techniques I', 'Fundamentals of VLSI design', 'Modeling of emerging electron devices', 'Physical models for micro and nanosystems', 'Integrated circuits technology', 'Hardware systems modeling', 'Lab in microelectronics', 'Introduction to VLSI Design', 'Optical detectors']

Assessment

As it can be seen, the outputs are different: here, the results of GMM for hard and soft clustering are the same here but they are different from the results of k-means. However, the recommandations seem to be quite good for this set of courses.

NB: A part of randomness was observed in the clustering based on GMM leading to different clusters but the reason of this could not be figured out since the initialization of the algorithm is done using K-Means which should always lead to the same output.

In order to take into consideration the graph of requirements, all these three methods were repeated for a graph obtained thanks to the combinaison of the graphs based on the students' choices (enrolments) and the requirements. Two output examples of the recommendation based on this new graph can be seen in the table of part 4.3 where it is compared to the diffusion method. This new graph was obtained in the same manner as in the Graph Diffusion part (normalization and weighted sum).

To test other courses in the recommendation functions, it is sufficient to re-run the notebook Spectral Clustering notebook and try different courses in the parts 8.1 and 8.2 of this notebook. Note that the course descriptions and enrolments dataframes must first be generated by running the Data Cleaning notebook after downloading the dataset as it was not possible to upload the dataset onto moodle (given its size).

4.2. Graph Diffusion

Using graph diffusion, the objective was to diffuse a dirac from a select number of nodes given by the courses of interest. Diffusing from there and sorting the courses based on the resulting signal from highest to lowest output would then give the most relevant courses. This was done in the Weighting Metrics and Graph Diffusion notebook using the following functions :



In [7]:

    
"""
Function used to diffuse a signal from a given number of nodes using a heat
diffusion filter with the desired tau. Note that tau should not be too small
in order to diffuse and not too high to avoid having a uniform signal
over the entire graph. 
"""
def heat_diffusion(G, courses, tau):
    # Create the heat diffusion filter
    filt = filters.Heat(G, tau)
    
    # Create the signal for the given graph
    signal = np.zeros(G.N)

    for course in courses:
        NODE = np.where(np.asarray(full_courses_list) == course)[0]
        signal[NODE] = 1

    # Apply the filter to the signal
    filtered_s = filt.filter(signal)

    return filtered_s

"""
This function calls the graph diffusion for valid course codes 
and keeps the most relevant courses for the recommendation. 
The output is the list of courses in descending order. 
"""
def diffusion(weight_mat, list_loved_courses, n_result_courses, tau_filter):
    # Define the index of the loved courses to hgighlight them later.
    NODE = []
    
    for i in range(0,len(list_loved_courses)):
        if (len(np.where(np.asarray(full_courses_list) == list_loved_courses[i])[0])==0):
            print("ERROR! Course loved is not in the list of the courses.")
            return
        NODE.append(np.where(np.asarray(full_courses_list) == list_loved_courses[i])[0][0])
    
    # Create the graph and do the diffusion on it.
    G_diffusion = create_graph(weight_mat)
    filtered_signals = heat_diffusion(G_diffusion,list_loved_courses,tau_filter)
    
    # Plot the diffusion
    G_diffusion.set_coordinates("spring")#G_diffusion.U[:,1:3])
    G_diffusion.plot_signal(filtered_signals, vertex_size=50, highlight = NODE, )
    
    # Create the list of courses ordered with their values found by the diffusion. 
    filtered_signals_int = list(filtered_signals)
    courses_list = []
    
    if(n_result_courses > len(filtered_signals_int)):
        n_result_courses = len(filtered_signals_int)
        
    for i in range(0,n_result_courses):
        course_code = full_courses_list[filtered_signals_int.index(max(filtered_signals_int))]
        courses_list.append(courses[courses.index.str.endswith(course_code)].CourseTitleFR.tolist()[0])
        filtered_signals_int[filtered_signals_int.index(max(filtered_signals_int))] = -1
    
    return courses_list

"""
# Example of usage : note this is extracted from the original notebook
weight_different_graph = [0.2,0,0,0,0,0,1] # weights used for the different graphs

# Combine the different graphs
diffusion_graph = weight_different_graph[0]*weight_matrices[0]
for i in range(1, len(weight_matrices)):
    diffusion_graph = diffusion_graph + weight_different_graph[i]*weight_matrices[i]

# Call the recommendation
recommended_courses = diffusion(diffusion_graph,["MICRO-421", "MICRO-520"],7,4)
"""
print()

Originally, the idea was to do supervised learning to do a grid search in order to :

tune the weighting coefficients between the different graphs
selecting / deisigning the best filter : test different low pass filters with varying hyper-parameters. It was important to use a low pass filter to keep low frequencies which attribute more importance to neighbors.

Unfortunately it was not possible to conduct a grid search as there was no ground truth upon which to assess the results to automatically select the best combination. One idea was to use the probabilities of two courses being taken which is computed in Computing Course Probabilities notebook, but this would have overfit on the enrolments graph as it is based on the same data. Another alternative was to rely on a new database of courses taken by students currently finishing their masters and adding to it the courses that they would have liked to take as well (see the Performance Evaluation notebook), this unfortunately was not sufficient in order to construct a ground truth. As a matter of fact, this application would have benefitted from the knowledge of the sutdents' assessment of each course but this data was not available. Therefore without a metric to assess the results, personal knowledge was used to assess the outputs and optimize the coefficients.

4.2.1. Recommendations For Multiple Courses

Using the baseline graph, and given a list of courses of interest, a dirac is diffused using a heat filter from the different nodes. An example output can be seen in the following.

Graph Diffusion From Automatic Speech Processing (EE-554) and Applied Machine Learning (MICRO-455) visualized on the eigenvectors of the normalized Laplacian

Graph Diffusion From Automatic Speech Processing (EE-554) and Applied Machine Learning (MICRO-455) visualized on a spring format

Outputting the top 15 corresponding recommendations given these two courses the following courses are obtained: 'Applied machine learning', 'Automatic speech processing', 'Advanced lab in electrical engineering', 'Composants semiconducteurs', 'Introduction to planetary sciences', 'Space mission design and operations', 'Wave propagation along transmission lines', 'Mécanique vibratoire', 'Lasers: theory and modern applications', 'Lab in microelectronics', 'Scaling laws in micro- and nanosystems', 'Advanced lab in electrical energy systems', 'Fracture mechanics', 'Social media', 'Image and video processing'

Applied Machine Learning is a course given mostly for micro-engineering sutdents whereas automatic speech processing is an electrical engineering course. That's why in the list presented before, the courses are ones given in these two sections. It is logical as this has been found using the student graph. However, the subjects of the courses are not really the same. For example, Composants semiconducteurs is a micro-engineering course about transistors which is not really related to the machine learning and speech processing fields. Using this graph to do recommendation seems interesting in the sense that it proposes courses of the same section, but other graphs need to be considered to obtain courses which are more related in terms of topics covered during the class.

4.2.2. Weighting the different graphs

As mentioned previously, it is important to consider different graphs during the recommendation process. The graphs were summed using weights to adjust the importance of each of them.

Looking at the outputs when combining with the different requirements graphs while diffusing from applied machine learning the results are the following (with a tau of 4):

Enrolments Weight	1	0	1e-11	1e-11	1e-11	1e-11
Assistants Weight	0	1	0	0	0	0
Professors Weight	0	0	1	0	0	0
Courses With Same Requirements Weight	0	0	0	1	0	0
Course To Its Requirements Weight	0	0	0	0	1	0
Requirements of same course Weight		0	0	0	0	1
Outputs	'Applied machine learning', 'Advanced lab in electrical engineering', 'Composants semi-conducteurs', 'Introduction to planetary sciences', 'Lasers: theory and modern applications', 'Wave propagation along transmission lines', 'Mécanique vibratoire'	'Applied machine learning', 'Analysis and modelling of locomotion', 'Introduction to cellular and molecular biotechnology', 'Biotechnology lab (for CGC)', 'Pharmaceutical biotechnology', 'Biomaterials', 'Advanced bioengineering methods laboratory'	'Applied machine learning', 'Turbomachines thermiques', 'Hydrodynamics', 'Scaling laws in micro- and nanosystems', 'Composants semiconducteurs', 'Biomechanics of the musculoskeletal system', 'Lifecycle performance of product systems'	'Applied machine learning', 'Lasers: theory and modern applications', 'Advanced lab in electrical engineering', 'Wave propagation along transmission lines', 'Introduction to planetary sciences', 'Space mission design and operations', 'Lab in microelectronics'	'Applied machine learning', 'Advanced machine learning', 'Lasers: theory and modern applications', 'Advanced lab in electrical engineering', 'Wave propagation along transmission lines', 'Lab in microelectronics', 'Introduction to planetary sciences'	'Applied machine learning', 'Advanced machine learning', 'Automatic speech processing', 'Production management', 'Image processing II', 'Biomicroscopy I', 'Biomicroscopy II'

Looking at the results obtained from the graph diffusion from Applied machine learning :

The professor and assistant's graph do not lead to good results as the first recommendation is Analysis and Modelling of Locomotion for the assistants and Turbomachines thermiques which has nothing to do with the cursus, the following recommendations are even less coherent
The courses with same requirements graph: the results are related to the same section but there are many other courses which would more pertinent.
Course to its requirements graph : the follow up course to applied machine learning is recommended however the results quickly diverge. generally speaking the first results are generally sufficient using this graph, taking a larger list of recommendations usually steers quickly of topic.
requirements of same course graph : this graph is the only one which outputs courses which are relevent such as automatic speech processing, advanced machine learning and image processing which could in fact be taken by a student with interests in these domains.

Given all of this, only the graph linking requirements of the same course is maintained for the final application. This graph will be designated by requirements graph for the following sections. A general remark concerning the requirements graph is that only the immediate neighbours of the nodes are pertinent, looking farther away from the node often leads to courses which are irrelevant, as mentioned previously.

The next step was then to find the optimal weights to combine the remaining graphs. The issue is that the relevance of the results depend highly on the courses. The outputs were tested for different combinations.

Courses	Enrolment	Requirement	Result
Applied Machine Learning	0.15	1	['Applied machine learning','Advanced machine learning','Automatic speech processing','Biomicroscopy II','Biomicroscopy I','Image optics','Lasers: theory and modern applications']
Applied Machine Learning	0.2	1	['Applied machine learning','Advanced machine learning','Lasers: theory and modern applications','Advanced lab in electrical engineering','Wave propagation along transmission lines','Introduction to planetary sciences','Space mission design and operations']
Applied Machine Learning	0.3	1	['Applied machine learning','Advanced lab in electrical engineering','Lasers: theory and modern applications','Wave propagation along transmission lines','Introduction to planetary sciences','Space mission design and operations','Mécanique vibratoire']
Lab on app development for tablets and smartphones	0.2	1	['Lab on app development for tablets and smartphones','Semester project for specialisation C','Advanced analog and RF integrated circuits design I','Image communication','Lab in signal and image processing','HF and VHF circuits and techniques II','Mathematics of data: from theory to computation']
Automatic speech processing	0.2	1	['Automatic speech processing','Advanced machine learning','Applied machine learning','Biomicroscopy I','Biomicroscopy II','Image optics','Optical communications']

In the three first lines of the table, Applied Machine Learning was used as the course of interest. The process to tune the weights was first to set the weights to 1 for both graphs then vary the weight of the enrolments graph. Doing so leads to varying results, only if the weight of the graphs are different.

For Applied Machine Learning :

enrolment weight = 0.15 : the influence of the requirements graph was too high as the results were identical to the ones obtained solely with the corresponding requirement graph
enrolment weight = 0.2 : there is a better balance and the results is a combination of the outputs obtained with the two graphs
enrolment weights = 0.3 : the influence of the enrolment's graph was too strong and nearly disregarded the requirements information.

Therefore the best weight would seem to be 0.2 for the Applied Machine Learning course. Testing this combination for different courses :

Lab on App Development for Smarphones and Tablets : the results are satisfying and seem coherent.
Automatic Speech Processing : the influence of the requirement is too hight which would indicate that the weighting coefficient needs to be adapted for the different courses.

Testing the outputs using multiple input courses and outputting only the top 7 (with the first two corresponding to the input courses) :

Courses	Results
Applied Machine Learning + Space Mission Design and Operations	['Space mission design and operations', 'Applied machine learning', 'Introduction to planetary sciences', 'Advanced lab in electrical engineering', 'Wave propagation along transmission lines', 'Lasers: theory and modern applications', 'Mécanique vibratoire']
Nanoelectronics + Analog Circuit Design I	['Analog circuits design I', 'Nanoelectronics', 'HF and VHF circuits and techniques I', 'Lab in EDA based design', 'Lab in acoustics', 'Analog circuits design II', 'Industrial electronics I']
Image Optics + Laser Microprocessing	['Image optics','Biomicroscopy II','Biomicroscopy I','Laser microprocessing','Capteurs','Image processing II','Composites technology']

Generally speaking the recommendation system seems to perform well, there are of course certain adjustments which need to be made but the courses are for the most case relevant and could indeed be taken by a student given the topics of interest.

4.3. Assessment Of The Different Methods

It appears interesting to compare the results obtained with the spectral clustering and the ones obtained by the diffusion. To do so, a table is presented which summarizes the results found with every method on the student graph and the mixed graph for the course Automatic Speech Processing:

Method used for the course Automatic Speech Processing	Results
K-means on the student graph	['Automatic speech processing', 'Biomedical signal processing', 'Microwaves', 'Systems and architectures for signal processing', 'Speech processing', 'Image analysis and pattern recognition', 'Lab in microwaves', 'Propagation of acoustic waves', 'Photonic systems and technology', 'Lab in signal and image processing', 'Wireless receivers: algorithms and architectures', 'Image and video processing', 'Image communication', 'Optical communications', 'Rayonnement et antennes', 'Lab in acoustics', 'Brain computer interaction', 'Lab on app development for tablets and smartphones', 'Social media', 'Audio', 'Mathematics of data: from theory to computation', 'Media security']
GMM soft-clustering on the student graph	['Automatic speech processing', 'Image processing I', 'Image processing II', 'Model predictive control', 'Transducteurs et entraînements intégrés', 'Mobile robots', "Commande d'actionneurs à l'aide d'un microprocesseur + TP", 'Signal processing for functional brain imaging', 'Robotics practicals', 'Haptic human robot interfaces', 'Advanced machine learning', 'In Silico neuroscience', 'Scientific literature analysis in Neuroscience']
GMM hard-clustering on the student graph	['Automatic speech processing', 'Image processing I', 'Image processing II', 'Sensors in medical instrumentation', 'Applied machine learning', 'Biomechanics of the cardiovascular system', 'Biomaterials (pour MX)', 'Biomicroscopy II', 'Fundamentals of biosensors and electronic biochips', 'Flexible bioelectronics', 'Biomechanics of the musculoskeletal system', 'Biomicroscopy I', 'Seminar in physiology and instrumentation', 'BioMEMS', 'Biomaterials', 'Advanced nanomaterials', 'Space mission design and operations', 'Biomedical optics', 'Composites technology', 'Numerical methods in biomechanics', 'Life cycle engineering of polymers', 'Tribology', 'Electrochemistry for materials technology', 'Thin film fabrication processes', 'Powder technology', 'Electron microscopy: advanced methods', 'Cementitious materials (advanced)']
K-mean on the combined graph	['Automatic speech processing', 'Biomedical signal processing', 'Microwaves', 'Systems and architectures for signal processing', 'Wireless receivers: algorithms and architectures', 'Image analysis and pattern recognition', 'Lab in microwaves', 'Speech processing', 'Photonic systems and technology', 'Mathematics of data: from theory to computation', 'Lab in signal and image processing', 'Image and video processing', 'Optical communications', 'Image communication', 'Social media', 'Rayonnement et antennes', 'Lab in acoustics', 'Media security', 'Brain computer interaction', 'Audio']
GMM soft-clustering on the combined graph	['Automatic speech processing', 'Biomedical signal processing', 'Image analysis and pattern recognition', 'Systems and architectures for signal processing', 'Microwaves', 'Speech processing', 'Photonic systems and technology', 'Lab in signal and image processing', 'Image and video processing', 'Lab in microwaves', 'Image communication', 'Wireless receivers: algorithms and architectures', 'Rayonnement et antennes', 'Optical communications', 'Mathematics of data: from theory to computation', 'Media security', 'Lab in acoustics', 'Social media']
GMM hard-clustering on the combined graph	['Automatic speech processing', 'Biomedical signal processing', 'Image analysis and pattern recognition', 'Systems and architectures for signal processing', 'Microwaves', 'Speech processing', 'Photonic systems and technology', 'Lab in signal and image processing', 'Image and video processing', 'Lab in microwaves', 'Image communication', 'Wireless receivers: algorithms and architectures', 'Rayonnement et antennes', 'Optical communications', 'Mathematics of data: from theory to computation', 'Brain computer interaction', 'Media security', 'Lab in acoustics', 'Social media', 'Audio']
Diffusion on the student graph	['Automatic speech processing','Image and video processing','Sensors in medical instrumentation','Lessons learned from the space exploration','Bio-nano-chip design','Introduction to VLSI Design','Lab in microwaves']
Diffusion on the combined graph	['Automatic speech processing','Advanced machine learning','Applied machine learning','Biomicroscopy I','Biomicroscopy II','Image optics','Optical communications']

Comparing the spectral clustering methods with the diffusion, the courses proposed are not the same. However, they remain in the same field in the sense that most of them are given by the electrical engineering faculty, such as "Automatic Speech Processing". It is not really possible to conclude whether a method is better than another as there is no information regarding how the students liked the different courses. It is not possible to determine which courses should appear at the top of the recommendations' list using Automatic Speech Processing. It appears also interesting to analyze the recommendation results for multiple courses. The following table presents them for the courses Nanoelectronics and Analog circuits design I:

Method used for the course Nanoelectronics + Analog circuits design I	Results
K-means on the student graph	['Nanoelectronics', 'Analog circuits design I', 'Fundamentals of VLSI design', 'Lab in EDA based design', 'Test of VLSI systems', 'HF and VHF circuits and techniques I', 'Modeling of emerging electron devices', 'Physical models for micro and nanosystems', 'Hardware systems modeling I', 'Hardware systems modeling', 'Advanced analog and RF integrated circuits design I', 'Introduction to VLSI Design', 'Integrated circuits technology', 'Analog circuits design II', 'Bio-nano-chip design', 'Analog circuits for biochip', 'Advanced VLSI design', 'Advanced analog and RF integrated circuits design II', 'Lab in microelectronics', 'HF and VHF circuits and techniques II', 'Advanced wireless communications: algorithms and architectures', 'Electrical filters', 'Advanced lab in electrical engineering']
GMM soft-clustering on the student graph	['Nanoelectronics', 'Analog circuits design I', 'Test of VLSI systems', 'Lab in EDA based design', 'HF and VHF circuits and techniques I', 'Fundamentals of VLSI design', 'Modeling of emerging electron devices', 'Physical models for micro and nanosystems', 'Integrated circuits technology', 'Hardware systems modeling', 'Lab in microelectronics', 'Introduction to VLSI Design', 'Optical detectors']
GMM hard-clustering on the student graph	['Nanoelectronics', 'Analog circuits design I', 'Test of VLSI systems', 'Lab in EDA based design', 'HF and VHF circuits and techniques I', 'Fundamentals of VLSI design', 'Modeling of emerging electron devices', 'Physical models for micro and nanosystems', 'Integrated circuits technology', 'Hardware systems modeling', 'Lab in microelectronics', 'Introduction to VLSI Design', 'Optical detectors']
K-mean on the combined graph	['Nanoelectronics', 'Analog circuits design I', 'Fundamentals of VLSI design', 'HF and VHF circuits and techniques I', 'Test of VLSI systems', 'Modeling of emerging electron devices', 'Advanced analog and RF integrated circuits design I', 'Hardware systems modeling I', 'Introduction to VLSI Design', 'Hardware systems modeling', 'Lab in EDA based design', 'Integrated circuits technology', 'Bio-nano-chip design', 'Lab in microelectronics', 'Analog circuits design II', 'Advanced analog and RF integrated circuits design II', 'Analog circuits for biochip', 'HF and VHF circuits and techniques II', 'Lasers: theory and modern applications', 'Advanced wireless communications: algorithms and architectures', 'Electrical filters']
GMM soft-clustering on the combined graph	['Nanoelectronics', 'Analog circuits design I', 'Fundamentals of VLSI design', 'HF and VHF circuits and techniques I', 'Test of VLSI systems', 'Modeling of emerging electron devices', 'Advanced analog and RF integrated circuits design I', 'Hardware systems modeling I', 'Hardware systems modeling', 'Introduction to VLSI Design', 'Integrated circuits technology', 'Bio-nano-chip design', 'Analog circuits design II', 'Lab in microelectronics', 'Lab in EDA based design', 'Advanced analog and RF integrated circuits design II', 'Analog circuits for biochip', 'HF and VHF circuits and techniques II', 'Advanced wireless communications: algorithms and architectures', 'Electrical filters', 'Physical models for micro and nanosystems', 'Propagation of acoustic waves', 'Advanced lab in electrical engineering', 'Advanced VLSI design']
GMM hard-clustering on the combined graph	['Nanoelectronics', 'Analog circuits design I', 'Fundamentals of VLSI design', 'HF and VHF circuits and techniques I', 'Test of VLSI systems', 'Modeling of emerging electron devices', 'Advanced analog and RF integrated circuits design I', 'Hardware systems modeling I', 'Hardware systems modeling', 'Introduction to VLSI Design', 'Integrated circuits technology', 'Bio-nano-chip design', 'Analog circuits design II', 'Lab in microelectronics', 'Lab in EDA based design', 'Advanced analog and RF integrated circuits design II', 'Analog circuits for biochip', 'HF and VHF circuits and techniques II', 'Advanced wireless communications: algorithms and architectures', 'Lasers: theory and modern applications', 'Electrical filters']
Diffusion on the student graph	['Analog circuits design I','Nanoelectronics','Lab in EDA based design','Lab in acoustics','Image analysis and pattern recognition','Biomedical optics','Pharmaceutical biotechnology']
Diffusion on the combined graph	['Analog circuits design I','Nanoelectronics','HF and VHF circuits and techniques I','Lab in EDA based design','Lab in acoustics','Analog circuits design II','Industrial electronics I']

The analysis of this table is similar to the one presented before: the courses recommended are electrical engineering courses which is a good point but it is not possible to determine precisely which method is best. One important point to be made is the fact that there is a risk of of overfitting the recommendations to the sections and their study plans. Doing so would lead to recommending courses which are not "original" as results.

5. Conclusion

Given that there was no way of doing supervised learning on the graph, the results could not objectively be assessed based on a ground truth or an objective metric. Personal knowledge of the courses within the STI faculty, more precisely within the Micro-engineering and Electrical engineering curriculums was used to judge the quality of the results which were generally satisfying. There are nonetheless improvements which can be made to graphs which were constructed :

For the professors' graph, as the matrix is quite sparse, another metric would need to be found to link the professors together such as the laboratories, the research domains etc...
For the assistants' graph an exhaustive list of assistants per course, including student assistants and teaching assistants would be required.
For the topics' graph, it would be interesting to find a better way to link the courses based on their content as this information is provided in the course descriptions and has much to offer.

Generally, the recommendation system would have benefitted largely from the knowledge of the student's preferences. Ideally one would have had the rankings of the courses or the course evaluations to know which courses students liked to recommend them to others and avoid the bias caused by the mandatory courses in the curriculum. This unfortunately was not possible as there was no way to access the course evaluations, not forgetting the fact that the evaluations are anonymous and therefore cannot be linked to a given student, as well as the fact that students do not always participate or take them very seriously. That is why it would be interesting to take the project to another level and incorporate a form of reinforcement learning so that the recommender system may take in feedback from the users and use it to improve the outputs.



In [ ]:

	PersonID	SubjectID	Year	Semester	SubjectName	SectionName	StudentSectionCode	StudyPlanCode	CourseCodes	CourseSection	CourseID
0	58e958ea4531b5624c0a827ed0b139ce	62918990845509bd69e684af98230cf2	2007-2008	Master semestre 1	Optics III	Physique	PH	MA3	PHYS-449	PHYS	449
1	ae7eb0574297b5e692272779cd2f6fa0	62918990845509bd69e684af98230cf2	2007-2008	Master semestre 1	Optics III	Physique	PH	MA3	PHYS-449	PHYS	449
2	43bd48d5f43035da625c4db1945307b2	62918990845509bd69e684af98230cf2	2007-2008	Master semestre 1	Optics III	Physique	PH	MA3	PHYS-449	PHYS	449
3	6d1316f513268449837932316ea0d696	62918990845509bd69e684af98230cf2	2007-2008	Master semestre 1	Optics III	Physique	PH	MA3	PHYS-449	PHYS	449
4	f3562b61dc25a46c900bd0b2d078cc91	62918990845509bd69e684af98230cf2	2007-2008	Master semestre 1	Optics III	Physique	PH	MA3	PHYS-449	PHYS	449

	AcademicYear	CourseTitleFR	CourseTitleEN	ProfessorSCIPERs	AssistantSCIPERs	KeyWords_FR	KeyWords_EN	StudyPlans	Requirements	Summary_Concepts_Contents_EN	Summary_Concepts_Contents_FR
CourseCode
AR-301(d)	2016-2017	Théorie et critique du projet BA5 (Fröhlich M....	Studio BA5 (Fröhlich M. & A.)	225189;225190	NaN	[sculpture et expression, habitat et commerce,...	[tradition and modernity, location and context...	AR	[]	[the atelier will examine the programmatic and...	[l'atelier examinera le programme et l'organis...
AR-302(d)	2016-2017	Théorie et critique du projet BA6 (Fröhlich M....	Studio BA6 (Fröhlich M. & A.)	225189;225190	NaN	[]	[]	AR	[]	[the atelier will examine the programmatic and...	[l'atelier examinera le programme et l'organis...
AR-321	2016-2017	Théorie de l'architecture V	Theory of architecture V	263770	NaN	[théorie de l'architecture, société, histoire,...	[theory of architecture, modernity, internatio...	AR	[AR-322, AR-321]	[fluidity, atmosphere, political, the course o...	[le but est d'interpréter et de critiquer de t...
AR-322	2016-2017	Théorie de l'architecture VI	Theory of architecture VI	263770	NaN	[théorie de l'architecture, société, histoire,...	[]	AR	[AR-321]	[the course offer an overview of architectural...	[le but est d'interpréter et de critiquer de t...
AR-323	2016-2017	Histoire de l'architecture V	History of architecture V	178030	NaN	[]	[]	AR	[]	[the course aim to provide critical tool for a...	[avec un accent particulier sur le théories de...